Using R and Longitudinal Data System Records to Answer Policy Questions

Jared Knowles

Overview

Why R?

R

Google Scholar Hits

R has recently passed Stata on Google Scholar hits and it is catching up to the two major players SPSS and SAS

R Has an Active Web Presence

R is linked to from more and more sites

R Extensions

These links come from the explosion of add-on packages to R

R Has an Active Community

Usage of the R listserv for help has really exploded recently

R Examples

Read in Data

studat<-read.csv('data/smalldata.csv')
str(studat[5:18,])
## 'data.frame':    14 obs. of  32 variables:
##  $ X          : int  274 276 478 574 613 620 717 772 1004 1056 ...
##  $ school     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ stuid      : int  142705 14995 120205 103495 55705 28495 37705 52705 41995 10705 ...
##  $ grade      : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ schid      : int  205 495 205 495 205 495 205 205 495 205 ...
##  $ dist       : int  75 105 15 45 75 45 75 75 105 75 ...
##  $ white      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ black      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ hisp       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ indian     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ asian      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ econ       : int  1 1 1 1 1 0 1 0 1 1 ...
##  $ female     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ell        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ disab      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ sch_fay    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ dist_fay   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ luck       : int  0 0 1 0 0 0 0 1 0 0 ...
##  $ ability    : num  81.9 101.9 87.3 96.6 98.4 ...
##  $ measerr    : num  52.98 22.6 4.67 -9.35 -7.7 ...
##  $ teachq     : num  56.68 71.62 66.88 75.21 4.95 ...
##  $ year       : int  2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
##  $ attday     : int  156 157 169 180 170 152 162 180 152 165 ...
##  $ schoolscore: num  56 56 56 56 56 ...
##  $ district   : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ schoolhigh : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ schoolavg  : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ schoollow  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ readSS     : num  373 437 418 454 310 ...
##  $ mathSS     : num  441 463 436 434 284 ...
##  $ proflvl    : Factor w/ 4 levels "advanced","basic",..: 2 4 4 4 3 2 4 4 2 3 ...
##  $ race       : Factor w/ 5 levels "A","B","H","I",..: 2 2 2 2 2 2 2 2 2 2 ...
# source('data/simulate_data.R')

Simple Diagnostics

source('ggplot2themes.R')
library(ggplot2)
qplot(readSS,mathSS,data=studat,alpha=I(.2))+geom_smooth(aes(group=ell,color=factor(ell)))+theme_dpi()

Advanced Diagnostics

samp<-sample(studat$stuid,24)
plotsub<-subset(studat,stuid%in%samp)
qplot(grade,readSS,data=plotsub)+facet_wrap(~stuid,nrow=4,ncol=6)+theme_dpi()+geom_line()+geom_smooth(method='lm',se=FALSE)

More Advanced

What we do?

Evaluations of Policy

Results

Conclusions and Next Steps

Conclusions

Next Steps

Inference Trees

Outline of the conditional inference tree structure.

library(partykit)
mypar<-ctree_control(testtype='Bonferroni',mincriterion=0.99)
mytree<-ctree(mathSS~race+econ+ell+disab+sch_fay+dist_fay+attday+readSS,
              data=subset(studat,grade==3))
plot(mytree)
plot of chunk parttree

plot of chunk parttree

Can Standardize and Share / Compare Results

Code collaboration

Some code sharing exists

Can do more

##   sid school_year race_ethnicity
## 1   1        2004              B
## 2   1        2005              H
## 3   1        2006              H
## 4   1        2007              H
## 5   2        2006              W
## 6   2        2007              B

What business rules do we use?

What to do?

head(stuatt,4)
##   sid school_year race_ethnicity
## 1   1        2004              B
## 2   1        2005              H
## 3   1        2006              H
## 4   1        2007              H

Fix the data

stuatt$race2<-stuatt$race_ethnicity
stuatt$race2[[1]]<-"H"
head(stuatt,4)
##   sid school_year race_ethnicity race2
## 1   1        2004              B     H
## 2   1        2005              H     H
## 3   1        2006              H     H
## 4   1        2007              H     H

Do analytics on fixed data